R4DS 03 - Data transformation with dplyr
The codes below are from R4DS book by Hadley Wickham.
Loading required packages:The nycflights13 dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from US Bureau of Transportation Statistics.
head(flights)
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Selecting all flights on Jan 1st:
filter(flights, month == 1, day == 1) # filter month = 1 (Jan) and day = 1 (1st)
# A tibble: 842 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 832 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# to filter for flights on xmas day
flights %>%
filter(month == 12, day == 25)
# A tibble: 719 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 12 25 456 500 -4 649
2 2013 12 25 524 515 9 805
3 2013 12 25 542 540 2 832
4 2013 12 25 546 550 -4 1022
5 2013 12 25 556 600 -4 730
6 2013 12 25 557 600 -3 743
7 2013 12 25 557 600 -3 818
8 2013 12 25 559 600 -1 855
9 2013 12 25 559 600 -1 849
10 2013 12 25 600 600 0 850
# … with 709 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 55,403 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 11 1 5 2359 6 352
2 2013 11 1 35 2250 105 123
3 2013 11 1 455 500 -5 641
4 2013 11 1 539 545 -6 856
5 2013 11 1 542 545 -3 831
6 2013 11 1 549 600 -11 912
7 2013 11 1 550 600 -10 705
8 2013 11 1 554 600 -6 659
9 2013 11 1 554 600 -6 826
10 2013 11 1 554 600 -6 749
# … with 55,393 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Q1 - Find all flights that had an arrival delay of two or more hours
flights %>%
filter(arr_delay >= 120)
# A tibble: 10,200 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 811 630 101 1047
2 2013 1 1 848 1835 853 1001
3 2013 1 1 957 733 144 1056
4 2013 1 1 1114 900 134 1447
5 2013 1 1 1505 1310 115 1638
6 2013 1 1 1525 1340 105 1831
7 2013 1 1 1549 1445 64 1912
8 2013 1 1 1558 1359 119 1718
9 2013 1 1 1732 1630 62 2028
10 2013 1 1 1803 1620 103 2008
# … with 10,190 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Q2 - Flew to Houston (IAH or HOU)
# A tibble: 9,313 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 623 627 -4 933
4 2013 1 1 728 732 -4 1041
5 2013 1 1 739 739 0 1104
6 2013 1 1 908 908 0 1228
7 2013 1 1 1028 1026 2 1350
8 2013 1 1 1044 1045 -1 1352
9 2013 1 1 1114 900 134 1447
10 2013 1 1 1205 1200 5 1503
# … with 9,303 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Q3 - Were operated by United, American or Delta
# A tibble: 139,504 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 554 600 -6 812
5 2013 1 1 554 558 -4 740
6 2013 1 1 558 600 -2 753
7 2013 1 1 558 600 -2 924
8 2013 1 1 558 600 -2 923
9 2013 1 1 559 600 -1 941
10 2013 1 1 559 600 -1 854
# … with 139,494 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Q4 - Departed in July, August, and September
# A tibble: 86,326 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 7 1 1 2029 212 236
2 2013 7 1 2 2359 3 344
3 2013 7 1 29 2245 104 151
4 2013 7 1 43 2130 193 322
5 2013 7 1 44 2150 174 300
6 2013 7 1 46 2051 235 304
7 2013 7 1 48 2001 287 308
8 2013 7 1 58 2155 183 335
9 2013 7 1 100 2146 194 327
10 2013 7 1 100 2245 135 337
# … with 86,316 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Q5 - Arrived more than two hours late, but didn’t leave late
flights %>%
filter(arr_delay >= 120, dep_delay <=0)
# A tibble: 29 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 27 1419 1420 -1 1754
2 2013 10 7 1350 1350 0 1736
3 2013 10 7 1357 1359 -2 1858
4 2013 10 16 657 700 -3 1258
5 2013 11 1 658 700 -2 1329
6 2013 3 18 1844 1847 -3 39
7 2013 4 17 1635 1640 -5 2049
8 2013 4 18 558 600 -2 1149
9 2013 4 18 655 700 -5 1213
10 2013 5 22 1827 1830 -3 2217
# … with 19 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Q6 - Were delayed by at least an hour, but made up over 30 minutes in flight
flights %>%
filter(dep_delay >=60, dep_delay - arr_delay >30)
# A tibble: 1,844 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 2205 1720 285 46
2 2013 1 1 2326 2130 116 131
3 2013 1 3 1503 1221 162 1803
4 2013 1 3 1839 1700 99 2056
5 2013 1 3 1850 1745 65 2148
6 2013 1 3 1941 1759 102 2246
7 2013 1 3 1950 1845 65 2228
8 2013 1 3 2015 1915 60 2135
9 2013 1 3 2257 2000 177 45
10 2013 1 4 1917 1700 137 2135
# … with 1,834 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Q7 - Departed between midnight and 6am
flights %>%
filter(dep_time <= 600 | dep_time == 2400)
# A tibble: 9,373 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 9,363 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Using the between function to filter flights that departed in summer:
flights %>%
filter(between(month, 7, 9))
# A tibble: 86,326 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 7 1 1 2029 212 236
2 2013 7 1 2 2359 3 344
3 2013 7 1 29 2245 104 151
4 2013 7 1 43 2130 193 322
5 2013 7 1 44 2150 174 300
6 2013 7 1 46 2051 235 304
7 2013 7 1 48 2001 287 308
8 2013 7 1 58 2155 183 335
9 2013 7 1 100 2146 194 327
10 2013 7 1 100 2245 135 337
# … with 86,316 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
How many flights have a missing dep_time?
# A tibble: 8,255 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 NA 1630 NA NA
2 2013 1 1 NA 1935 NA NA
3 2013 1 1 NA 1500 NA NA
4 2013 1 1 NA 600 NA NA
5 2013 1 2 NA 1540 NA NA
6 2013 1 2 NA 1620 NA NA
7 2013 1 2 NA 1355 NA NA
8 2013 1 2 NA 1420 NA NA
9 2013 1 2 NA 1321 NA NA
10 2013 1 2 NA 1545 NA NA
# … with 8,245 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
flights %>%
arrange(year, month, day)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
flights %>%
arrange(desc(month))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 12 1 13 2359 14 446
2 2013 12 1 17 2359 18 443
3 2013 12 1 453 500 -7 636
4 2013 12 1 520 515 5 749
5 2013 12 1 536 540 -4 845
6 2013 12 1 540 550 -10 1005
7 2013 12 1 541 545 -4 734
8 2013 12 1 546 545 1 826
9 2013 12 1 549 600 -11 648
10 2013 12 1 550 600 -10 825
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Using arrange() to sort all the missing values to the start?
flights %>%
arrange(desc(is.na(dep_time), dep_time))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 NA 1630 NA NA
2 2013 1 1 NA 1935 NA NA
3 2013 1 1 NA 1500 NA NA
4 2013 1 1 NA 600 NA NA
5 2013 1 2 NA 1540 NA NA
6 2013 1 2 NA 1620 NA NA
7 2013 1 2 NA 1355 NA NA
8 2013 1 2 NA 1420 NA NA
9 2013 1 2 NA 1321 NA NA
10 2013 1 2 NA 1545 NA NA
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Sort flights to find the most delayed flights:
flights %>%
arrange(desc(dep_delay))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 9 641 900 1301 1242
2 2013 6 15 1432 1935 1137 1607
3 2013 1 10 1121 1635 1126 1239
4 2013 9 20 1139 1845 1014 1457
5 2013 7 22 845 1600 1005 1044
6 2013 4 10 1100 1900 960 1342
7 2013 3 17 2321 810 911 135
8 2013 6 27 959 1900 899 1236
9 2013 7 22 2257 759 898 121
10 2013 12 5 756 1700 896 1058
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Sort flights that left the earliest
flights %>%
arrange(dep_delay)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 12 7 2040 2123 -43 40
2 2013 2 3 2022 2055 -33 2240
3 2013 11 10 1408 1440 -32 1549
4 2013 1 11 1900 1930 -30 2233
5 2013 1 29 1703 1730 -27 1947
6 2013 8 9 729 755 -26 1002
7 2013 10 23 1907 1932 -25 2143
8 2013 3 30 2030 2055 -25 2213
9 2013 3 2 1431 1455 -24 1601
10 2013 5 5 934 958 -24 1225
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Sort flights to find the fastest flights
flights %>%
arrange(air_time) %>%
head()
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 16 1355 1315 40 1442
2 2013 4 13 537 527 10 622
3 2013 12 6 922 851 31 1021
4 2013 2 3 2153 2129 24 2247
5 2013 2 5 1303 1315 -12 1342
6 2013 2 12 2123 2130 -7 2211
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Which flights traveled the shortest?
flights %>%
arrange(air_time) %>%
head()
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 16 1355 1315 40 1442
2 2013 4 13 537 527 10 622
3 2013 12 6 922 851 31 1021
4 2013 2 3 2153 2129 24 2247
5 2013 2 5 1303 1315 -12 1342
6 2013 2 12 2123 2130 -7 2211
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
which flights travelled the longest?
flights %>%
arrange(desc(air_time))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 3 17 1337 1335 2 1937
2 2013 2 6 853 900 -7 1542
3 2013 3 15 1001 1000 1 1551
4 2013 3 17 1006 1000 6 1607
5 2013 3 16 1001 1000 1 1544
6 2013 2 5 900 900 0 1555
7 2013 11 12 936 930 6 1630
8 2013 3 14 958 1000 -2 1542
9 2013 11 20 1006 1000 6 1639
10 2013 3 15 1342 1335 7 1924
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# Selecting columns by name
flights %>%
select(year, month, day)
# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# … with 336,766 more rows
# Select all columns between year and day
flights %>%
select(year:day)
# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# … with 336,766 more rows
# Select all columns except those from year to day
flights %>%
select(-(year:day))
# A tibble: 336,776 x 16
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <dbl> <int> <int> <dbl>
1 517 515 2 830 819 11
2 533 529 4 850 830 20
3 542 540 2 923 850 33
4 544 545 -1 1004 1022 -18
5 554 600 -6 812 837 -25
6 554 558 -4 740 728 12
7 555 600 -5 913 854 19
8 557 600 -3 709 723 -14
9 557 600 -3 838 846 -8
10 558 600 -2 753 745 8
# … with 336,766 more rows, and 10 more variables: carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# Select() with everything()
flights %>%
select(time_hour, air_time, everything())
# A tibble: 336,776 x 19
time_hour air_time year month day dep_time
<dttm> <dbl> <int> <int> <int> <int>
1 2013-01-01 05:00:00 227 2013 1 1 517
2 2013-01-01 05:00:00 227 2013 1 1 533
3 2013-01-01 05:00:00 160 2013 1 1 542
4 2013-01-01 05:00:00 183 2013 1 1 544
5 2013-01-01 06:00:00 116 2013 1 1 554
6 2013-01-01 05:00:00 150 2013 1 1 554
7 2013-01-01 06:00:00 158 2013 1 1 555
8 2013-01-01 06:00:00 53 2013 1 1 557
9 2013-01-01 06:00:00 140 2013 1 1 557
10 2013-01-01 06:00:00 138 2013 1 1 558
# … with 336,766 more rows, and 13 more variables:
# sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# distance <dbl>, hour <dbl>, minute <dbl>
Variables may also be defined and called upon using all_of() and any_of()
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>%
select(all_of(vars))
# A tibble: 336,776 x 5
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
4 2013 1 1 -1 -18
5 2013 1 1 -6 -25
6 2013 1 1 -4 12
7 2013 1 1 -5 19
8 2013 1 1 -3 -14
9 2013 1 1 -3 -8
10 2013 1 1 -2 8
# … with 336,766 more rows
flights %>%
select(any_of(vars))
# A tibble: 336,776 x 5
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
4 2013 1 1 -1 -18
5 2013 1 1 -6 -25
6 2013 1 1 -4 12
7 2013 1 1 -5 19
8 2013 1 1 -3 -14
9 2013 1 1 -3 -8
10 2013 1 1 -2 8
# … with 336,766 more rows
Selecting variables that contain “time”
flights %>%
select(contains("time"))
# A tibble: 336,776 x 6
dep_time sched_dep_time arr_time sched_arr_time air_time
<int> <int> <int> <int> <dbl>
1 517 515 830 819 227
2 533 529 850 830 227
3 542 540 923 850 160
4 544 545 1004 1022 183
5 554 600 812 837 116
6 554 558 740 728 150
7 555 600 913 854 158
8 557 600 709 723 53
9 557 600 838 846 140
10 558 600 753 745 138
# … with 336,766 more rows, and 1 more variable: time_hour <dttm>
flights %>%
select(dep_time, sched_dep_time) %>%
mutate(dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
sched_dep_time_mins = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1440)
# A tibble: 336,776 x 4
dep_time sched_dep_time dep_time_mins sched_dep_time_mins
<int> <int> <dbl> <dbl>
1 517 515 317 315
2 533 529 333 329
3 542 540 342 340
4 544 545 344 345
5 554 600 354 360
6 554 558 354 358
7 555 600 355 360
8 557 600 357 360
9 557 600 357 360
10 558 600 358 360
# … with 336,766 more rows
flights %>%
mutate((dep_time = dep_time %/% 100*60 + dep_time %/% 100) %% 1440,
arr_time = (arr_time %/% 100*60 + arr_time %% 100) %% 1440,
air_time_diff = air_time - (arr_time + dep_time)) %>%
filter(air_time_diff != 0)
# A tibble: 327,344 x 21
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <dbl>
1 2013 1 1 517 515 2 510
2 2013 1 1 533 529 4 530
3 2013 1 1 542 540 2 563
4 2013 1 1 544 545 -1 604
5 2013 1 1 554 600 -6 492
6 2013 1 1 554 558 -4 460
7 2013 1 1 555 600 -5 553
8 2013 1 1 557 600 -3 429
9 2013 1 1 557 600 -3 518
10 2013 1 1 558 600 -2 473
# … with 327,334 more rows, and 14 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>,
# (dep_time = dep_time%/%100 * 60 + dep_time%/%100)%%1440 <dbl>,
# air_time_diff <dbl>
# check if air_time_diff = 0
There are many flights in which there was a difference. This could be because some flights passed midnight, or crossed time zones.
flights %>%
group_by( year, month, day) %>%
summarize(delay = mean(dep_delay, na.rm = T))
# A tibble: 365 x 4
# Groups: year, month [12]
year month day delay
<int> <int> <int> <dbl>
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# … with 355 more rows
To explore the relationship between distance and average delay for each location:
flights %>%
group_by(dest) %>%
summarize(count = n(),
dist = round(mean(distance, na.rm = T), 2),
delay = round(mean(arr_delay, na.rm = T), 2)) %>%
filter(count > 20, dest != "HNL") %>%
ggplot(aes(x = dist, y = delay)) +
geom_point(aes(size = count, alpha = 0.4)) +
geom_smooth(se = F) +
theme_classic() +
theme(legend.position = "none")
https://jrnold.github.io/r4ds-exercise-solutions/ https://r4ds.had.co.nz/
For attribution, please cite this work as
lruolin (2021, April 28). pRactice corner: Data transformation. Retrieved from https://lruolin.github.io/myBlog/posts/20210424_Tidyverse Chap 1 - Data visualization/
BibTeX citation
@misc{lruolin2021data, author = {lruolin, }, title = {pRactice corner: Data transformation}, url = {https://lruolin.github.io/myBlog/posts/20210424_Tidyverse Chap 1 - Data visualization/}, year = {2021} }